Proyecto de Curso – Ingeniería de Características – Fase 1.¶
Catedrático: Ing. Preng Biba Solares
Auxiliar: Ing. Jorge Alberto Osoy Barrera
Curso: Statics Learning
Alumnos participantes: Jordi Gian Carlo Chajón López (Carnet 23000477) y Felipe Carlos Escoto Castro (Carnet 23003984).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
dataset =pd.read_csv("global-data-on-sustainable-energy.csv")
dataset.head()
| Entity | Year | Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | ... | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Density (P/Km2) | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2000 | 1.613591 | 6.2 | 9.22 | 20000.0 | 44.99 | 0.16 | 0.0 | 0.31 | ... | 302.59482 | 1.64 | 760.000000 | NaN | NaN | NaN | 60 | 652230 | 33.93911 | 67.709953 |
| 1 | Afghanistan | 2001 | 4.074574 | 7.2 | 8.86 | 130000.0 | 45.60 | 0.09 | 0.0 | 0.50 | ... | 236.89185 | 1.74 | 730.000000 | NaN | NaN | NaN | 60 | 652230 | 33.93911 | 67.709953 |
| 2 | Afghanistan | 2002 | 9.409158 | 8.2 | 8.47 | 3950000.0 | 37.83 | 0.13 | 0.0 | 0.56 | ... | 210.86215 | 1.40 | 1029.999971 | NaN | NaN | 179.426579 | 60 | 652230 | 33.93911 | 67.709953 |
| 3 | Afghanistan | 2003 | 14.738506 | 9.5 | 8.09 | 25970000.0 | 36.66 | 0.31 | 0.0 | 0.63 | ... | 229.96822 | 1.40 | 1220.000029 | NaN | 8.832278 | 190.683814 | 60 | 652230 | 33.93911 | 67.709953 |
| 4 | Afghanistan | 2004 | 20.064968 | 10.9 | 7.75 | NaN | 44.24 | 0.33 | 0.0 | 0.56 | ... | 204.23125 | 1.20 | 1029.999971 | NaN | 1.414118 | 211.382074 | 60 | 652230 | 33.93911 | 67.709953 |
5 rows × 21 columns
Ánalisis Exploratorio Dataset Original¶
# Mostrar información sobre el dataset
print(dataset.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3648 entries, 0 to 3647 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Entity 3648 non-null object 1 Year 3648 non-null int64 2 Access to electricity (% of population) 3648 non-null float64 3 Access to clean fuels for cooking 3648 non-null float64 4 Renewable-electricity-generating-capacity-per-capita 3648 non-null float64 5 Financial flows to developing countries (US $) 3648 non-null float64 6 Renewable energy share in the total final energy consumption (%) 3648 non-null float64 7 Electricity from fossil fuels (TWh) 3648 non-null float64 8 Electricity from nuclear (TWh) 3648 non-null float64 9 Electricity from renewables (TWh) 3648 non-null float64 10 Low-carbon electricity (% electricity) 3648 non-null float64 11 Primary energy consumption per capita (kWh/person) 3648 non-null float64 12 Energy intensity level of primary energy (MJ/$2017 PPP GDP) 3648 non-null float64 13 Value_co2_emissions_kt_by_country 3648 non-null float64 14 Renewables (% equivalent primary energy) 3648 non-null float64 15 gdp_growth 3648 non-null float64 16 gdp_per_capita 3648 non-null float64 17 Density (P/Km2) 3648 non-null object 18 Land Area(Km2) 3648 non-null int64 19 Latitude 3648 non-null float64 20 Longitude 3648 non-null float64 dtypes: float64(17), int64(2), object(2) memory usage: 598.6+ KB None
# Describir estadísticamente el dataset
print(dataset.describe())
Year Access to electricity (% of population) \
count 3648.000000 3648.000000
mean 2010.041118 78.933693
std 6.052776 30.238162
min 2000.000000 1.252269
25% 2005.000000 59.916941
50% 2010.000000 98.272340
75% 2015.000000 100.000000
max 2020.000000 100.000000
Access to clean fuels for cooking \
count 3648.000000
mean 63.255504
std 38.133777
min 0.000000
25% 25.862500
50% 78.875000
75% 100.000000
max 100.000000
Renewable-electricity-generating-capacity-per-capita \
count 3648.000000
mean 92.493616
std 213.395777
min 0.000000
25% 8.380000
50% 32.880000
75% 67.472500
max 3060.190000
Financial flows to developing countries (US $) \
count 3.648000e+03
mean 4.353562e+07
std 1.998022e+08
min 0.000000e+00
25% 5.665000e+06
50% 5.665000e+06
75% 5.665000e+06
max 5.202310e+09
Renewable energy share in the total final energy consumption (%) \
count 3648.000000
mean 32.142379
std 29.168672
min 0.000000
25% 7.095000
50% 23.270000
75% 52.612500
max 96.040000
Electricity from fossil fuels (TWh) Electricity from nuclear (TWh) \
count 3648.000000 3648.000000
mean 69.996209 12.989315
std 347.131749 71.786293
min 0.000000 0.000000
25% 0.300000 0.000000
50% 2.970000 0.000000
75% 26.527500 0.000000
max 5184.130000 809.410000
Electricity from renewables (TWh) \
count 3648.000000
mean 23.845069
std 104.157507
min 0.000000
25% 0.050000
50% 1.470000
75% 9.560000
max 2184.940000
Low-carbon electricity (% electricity) \
count 3648.000000
mean 36.708904
std 34.129226
min 0.000000
25% 3.030303
50% 27.910000
75% 64.038130
max 100.000010
Primary energy consumption per capita (kWh/person) \
count 3648.000000
mean 25747.285360
std 34777.415694
min 0.000000
25% 3116.636825
50% 13118.841000
75% 33897.402500
max 262585.700000
Energy intensity level of primary energy (MJ/$2017 PPP GDP) \
count 3648.000000
mean 5.250461
std 3.438690
min 0.110000
25% 3.220000
50% 4.300000
75% 5.880000
max 32.570000
Value_co2_emissions_kt_by_country \
count 3.648000e+03
mean 1.423831e+05
std 7.285451e+05
min 1.000000e+01
25% 2.509557e+03
50% 1.050000e+04
75% 5.136250e+04
max 1.070722e+07
Renewables (% equivalent primary energy) gdp_growth gdp_per_capita \
count 3648.000000 3648.000000 3648.000000
mean 8.651135 3.441471 12613.230060
std 10.051457 5.434772 19077.099547
min 0.000000 -62.075920 111.927225
25% 6.290000 1.651476 1464.841885
50% 6.290000 3.440000 4578.630000
75% 6.290000 5.543696 13993.509465
max 86.836586 123.139555 123514.196700
Land Area(Km2) Latitude Longitude
count 3.648000e+03 3648.000000 3648.000000
mean 6.332135e+05 18.246388 14.822695
std 1.585519e+06 24.159232 66.348148
min 2.100000e+01 -40.900557 -175.198242
25% 2.571300e+04 3.202778 -11.779889
50% 1.176000e+05 17.189877 19.145136
75% 5.131200e+05 38.969719 46.199616
max 9.984670e+06 64.963051 178.065032
1. Determine que columnas poseen faltantes (NA o Nulos)¶
col_con_nan = []
for col in dataset.columns:
porcentaje_faltante = dataset[col].isnull().mean()
if(porcentaje_faltante > 0):
col_con_nan.append(col)
col_con_nan
['Access to electricity (% of population)', 'Access to clean fuels for cooking', 'Renewable-electricity-generating-capacity-per-capita', 'Financial flows to developing countries (US $)', 'Renewable energy share in the total final energy consumption (%)', 'Electricity from fossil fuels (TWh)', 'Electricity from nuclear (TWh)', 'Electricity from renewables (TWh)', 'Low-carbon electricity (% electricity)', 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)', 'Value_co2_emissions_kt_by_country', 'Renewables (% equivalent primary energy)', 'gdp_growth', 'gdp_per_capita']
2. Se determino la proporción de faltantes para cada columna con faltantes y se muestra en un gráfico de barras con el porcentaje de faltantes para cada columna.¶
porcentaje_nulos = dataset[col_con_nan].isnull().mean()
porcentaje_nulos_redondeado = round(porcentaje_nulos * 100, 2)
porcentaje_nulos_redondeado
Access to electricity (% of population) 0.25 Access to clean fuels for cooking 4.61 Renewable-electricity-generating-capacity-per-capita 25.52 Financial flows to developing countries (US $) 57.24 Renewable energy share in the total final energy consumption (%) 5.32 Electricity from fossil fuels (TWh) 0.58 Electricity from nuclear (TWh) 3.45 Electricity from renewables (TWh) 0.58 Low-carbon electricity (% electricity) 1.15 Energy intensity level of primary energy (MJ/$2017 PPP GDP) 5.65 Value_co2_emissions_kt_by_country 11.71 Renewables (% equivalent primary energy) 58.55 gdp_growth 8.66 gdp_per_capita 7.70 dtype: float64
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(porcentaje_nulos.index, porcentaje_nulos, color='skyblue')
ax.set_title('VALORES NULOS POR COLUMNA')
ax.set_ylabel('Porcentaje')
ax.set_xlabel('Columnas con nulos')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
C:\Users\escot\AppData\Local\Temp\ipykernel_22756\4160165302.py:9: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations. plt.tight_layout()
Notamos que hay varias columna que tiene datos faltantes por lo que procederemos a identificar la escala de cada una. Es decir, clasificaremos entre variables categóricas, continuas y discretas
categoricas = [col for col in dataset.columns if(dataset[col].dtypes == 'object')]
categoricas
['Entity', 'Density (P/Km2)']
categoricas_con_na = [col for col in categoricas if dataset[col].isnull().mean() > 0]
dataset[categoricas_con_na].isnull().mean()
Series([], dtype: float64)
continuas = [col for col in dataset.columns if((dataset[col].dtypes in ['int64', 'float64']) and len(dataset[col].unique()) > 30)]
continuas
['Access to electricity (% of population)', 'Access to clean fuels for cooking', 'Renewable-electricity-generating-capacity-per-capita', 'Financial flows to developing countries (US $)', 'Renewable energy share in the total final energy consumption (%)', 'Electricity from fossil fuels (TWh)', 'Electricity from nuclear (TWh)', 'Electricity from renewables (TWh)', 'Low-carbon electricity (% electricity)', 'Primary energy consumption per capita (kWh/person)', 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)', 'Value_co2_emissions_kt_by_country', 'Renewables (% equivalent primary energy)', 'gdp_growth', 'gdp_per_capita', 'Land Area(Km2)', 'Latitude', 'Longitude']
discretas = [col for col in dataset.columns if((dataset[col].dtypes in ['int64', 'float64']) and len(dataset[col].unique()) <= 30)]
discretas
['Year']
discretas_con_na = [col for col in discretas if dataset[col].isnull().mean() > 0]
dataset[discretas_con_na].isnull().mean()
Series([], dtype: float64)
1.1 Imputación de variables numéricas continuas:¶
Detectamos el porcentaje de faltantes en la variables numéricas continuas y seleccinamos aquellas variables que tiene valores faltantes
continuas_con_na = [col for col in continuas if dataset[col].isnull().mean() > 0]
dataset[continuas_con_na].isnull().mean()
Access to electricity (% of population) 0.002467 Access to clean fuels for cooking 0.046053 Renewable-electricity-generating-capacity-per-capita 0.255208 Financial flows to developing countries (US $) 0.572368 Renewable energy share in the total final energy consumption (%) 0.053180 Electricity from fossil fuels (TWh) 0.005757 Electricity from nuclear (TWh) 0.034539 Electricity from renewables (TWh) 0.005757 Low-carbon electricity (% electricity) 0.011513 Energy intensity level of primary energy (MJ/$2017 PPP GDP) 0.056469 Value_co2_emissions_kt_by_country 0.117050 Renewables (% equivalent primary energy) 0.585526 gdp_growth 0.086623 gdp_per_capita 0.077029 dtype: float64
continuas_con_na = [col for col in continuas if dataset[col].isnull().mean() > 0.06]
dataset[continuas_con_na].isnull().mean()
Renewable-electricity-generating-capacity-per-capita 0.255208 Financial flows to developing countries (US $) 0.572368 Value_co2_emissions_kt_by_country 0.117050 Renewables (% equivalent primary energy) 0.585526 gdp_growth 0.086623 gdp_per_capita 0.077029 dtype: float64
Dado que todas las variables tiene un porcentaje de NAN's mas grande del 5% será necesario realizar un anális particular para cada caso
1.1.1 Análisis para Variable "Access to electricity (% of population)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Access to electricity (% of population)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population)')
plt.show()
Imputación por Media y Mediana¶
mean_Access_to_electricity_of_population = round(dataset['Access to electricity (% of population)'].mean(), 2)
temp_series = dataset['Access to electricity (% of population)'].fillna(mean_Access_to_electricity_of_population)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population) - ' + str(mean_Access_to_electricity_of_population))
plt.show()
median_Access_to_electricity_of_population = round(dataset['Access to electricity (% of population)'].median(), 2)
temp_series = dataset['Access to electricity (% of population)'].fillna(median_Access_to_electricity_of_population)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to electricity (% of population) - ' + str(median_Access_to_electricity_of_population))
plt.show()
Utilizamos la imputacion por mean
dataset['Access to electricity (% of population)'].fillna(mean_Access_to_electricity_of_population, inplace=True)
1.1.2 Análisis para Variable "Access to clean fuels for cooking"¶
fig = plt.figure(figsize=(5, 3))
dataset['Access to clean fuels for cooking'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking')
plt.show()
Imputación por Media y Mediana¶
mean_Access_to_clean_fuels_for_cooking = round(dataset['Access to clean fuels for cooking'].mean(), 2)
temp_series = dataset['Access to clean fuels for cooking'].fillna(mean_Access_to_clean_fuels_for_cooking)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking - ' + str(mean_Access_to_clean_fuels_for_cooking))
plt.show()
median_Access_to_clean_fuels_for_cooking = round(dataset['Access to clean fuels for cooking'].median(), 2)
temp_series = dataset['Access to clean fuels for cooking'].fillna(median_Access_to_clean_fuels_for_cooking)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Access to clean fuels for cooking - ' + str(median_Access_to_clean_fuels_for_cooking))
plt.show()
Utilizamos la imputacion por mean
dataset['Access to clean fuels for cooking'].fillna(mean_Access_to_clean_fuels_for_cooking, inplace=True)
1.1.3 Análisis para Variable "Renewable-electricity-generating-capacity-per-capita"¶
fig = plt.figure(figsize=(5, 3))
dataset['Renewable-electricity-generating-capacity-per-capita'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita')
plt.show()
Imputación por Media y Mediana¶
mean_Renewable_electricity_generating_capacity_per_capita = round(dataset['Renewable-electricity-generating-capacity-per-capita'].mean(), 2)
temp_series = dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(mean_Renewable_electricity_generating_capacity_per_capita)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita - ' + str(mean_Renewable_electricity_generating_capacity_per_capita))
plt.show()
median_Renewable_electricity_generating_capacity_per_capita = round(dataset['Renewable-electricity-generating-capacity-per-capita'].median(), 2)
temp_series = dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(median_Renewable_electricity_generating_capacity_per_capita)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable-electricity-generating-capacity-per-capita - ' + str(median_Renewable_electricity_generating_capacity_per_capita))
plt.show()
Utilizamos la imputacion por median
dataset['Renewable-electricity-generating-capacity-per-capita'].fillna(median_Renewable_electricity_generating_capacity_per_capita, inplace=True)
1.1.4 Análisis de Variable "Financial flows to developing countries (US $)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Financial flows to developing countries (US $)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $)')
plt.show()
Imputación por Media y Mediana¶
mean_Financial_flows_to_developing_countries = round(dataset['Financial flows to developing countries (US $)'].mean(), 2)
temp_series = dataset['Financial flows to developing countries (US $)'].fillna(mean_Financial_flows_to_developing_countries)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $) - ' + str(mean_Financial_flows_to_developing_countries))
plt.show()
median_Financial_flows_to_developing_countries = round(dataset['Financial flows to developing countries (US $)'].median(), 2)
temp_series = dataset['Financial flows to developing countries (US $)'].fillna(median_Financial_flows_to_developing_countries)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Financial flows to developing countries (US $) - ' + str(median_Financial_flows_to_developing_countries))
plt.show()
Utilizamos la imputacion por median
dataset['Financial flows to developing countries (US $)'].fillna(median_Financial_flows_to_developing_countries, inplace=True)
1.1.5 Análisis de Variable "Renewable energy share in the total final energy consumption (%)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Renewable energy share in the total final energy consumption (%)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%)')
plt.show()
Imputación por Media y Mediana¶
mean_Renewable_energy_share_in_the_total_final_energy_consumption = round(dataset['Renewable energy share in the total final energy consumption (%)'].mean(), 2)
temp_series = dataset['Renewable energy share in the total final energy consumption (%)'].fillna(mean_Renewable_energy_share_in_the_total_final_energy_consumption)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%) - ' + str(mean_Renewable_energy_share_in_the_total_final_energy_consumption))
plt.show()
median_Renewable_energy_share_in_the_total_final_energy_consumption = round(dataset['Renewable energy share in the total final energy consumption (%)'].median(), 2)
temp_series = dataset['Renewable energy share in the total final energy consumption (%)'].fillna(median_Renewable_energy_share_in_the_total_final_energy_consumption)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewable energy share in the total final energy consumption (%) - ' + str(median_Renewable_energy_share_in_the_total_final_energy_consumption))
plt.show()
Utilizamos la imputacion por median
dataset['Renewable energy share in the total final energy consumption (%)'].fillna(median_Renewable_energy_share_in_the_total_final_energy_consumption, inplace=True)
1.1.6 Análisis de Variable "Electricity from fossil fuels (TWh)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from fossil fuels (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh)')
plt.show()
Imputación por Media y Mediana¶
mean_Electricity_from_fossil_fuels = round(dataset['Electricity from fossil fuels (TWh)'].mean(), 2)
temp_series = dataset['Electricity from fossil fuels (TWh)'].fillna(mean_Electricity_from_fossil_fuels)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh) - ' + str(mean_Electricity_from_fossil_fuels))
plt.show()
median_Electricity_from_fossil_fuels = round(dataset['Electricity from fossil fuels (TWh)'].median(), 2)
temp_series = dataset['Electricity from fossil fuels (TWh)'].fillna(median_Electricity_from_fossil_fuels)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from fossil fuels (TWh) - ' + str(median_Electricity_from_fossil_fuels))
plt.show()
Utilizamos la imputacion por median
dataset['Electricity from fossil fuels (TWh)'].fillna(median_Electricity_from_fossil_fuels, inplace=True)
1.1.7 Análisis de Variable "Electricity from nuclear (TWh)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from nuclear (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh)')
plt.show()
Imputación por Media y Mediana¶
mean_Electricity_from_nuclear = round(dataset['Electricity from nuclear (TWh)'].mean(), 2)
temp_series = dataset['Electricity from nuclear (TWh)'].fillna(mean_Electricity_from_nuclear)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh) - ' + str(mean_Electricity_from_nuclear))
plt.show()
median_Electricity_from_nuclear = round(dataset['Electricity from nuclear (TWh)'].median(), 2)
temp_series = dataset['Electricity from nuclear (TWh)'].fillna(median_Electricity_from_nuclear)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from nuclear (TWh) - ' + str(median_Electricity_from_nuclear))
plt.show()
Utilizamos la imputacion por median
dataset['Electricity from nuclear (TWh)'].fillna(median_Electricity_from_nuclear, inplace=True)
1.1.8 Análisis de Variable "Electricity from renewables (TWh)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Electricity from renewables (TWh)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh)')
plt.show()
Imputación por Media y Mediana¶
mean_Electricity_from_renewables = round(dataset['Electricity from renewables (TWh)'].mean(), 2)
temp_series = dataset['Electricity from renewables (TWh)'].fillna(mean_Electricity_from_renewables)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh) - ' + str(mean_Electricity_from_renewables))
plt.show()
median_Electricity_from_renewables = round(dataset['Electricity from renewables (TWh)'].median(), 2)
temp_series = dataset['Electricity from renewables (TWh)'].fillna(median_Electricity_from_renewables)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Electricity from renewables (TWh) - ' + str(median_Electricity_from_renewables))
plt.show()
Utilizamos la imputacion por median
dataset['Electricity from renewables (TWh)'].fillna(median_Electricity_from_renewables, inplace=True)
1.1.9 Análisis de Variable "Low-carbon electricity (% electricity)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Low-carbon electricity (% electricity)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity)')
plt.show()
Imputación por Media y Mediana¶
mean_Low_carbon_electricity = round(dataset['Low-carbon electricity (% electricity)'].mean(), 2)
temp_series = dataset['Low-carbon electricity (% electricity)'].fillna(mean_Low_carbon_electricity)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity) - ' + str(mean_Low_carbon_electricity))
plt.show()
median_Low_carbon_electricity = round(dataset['Low-carbon electricity (% electricity)'].median(), 2)
temp_series = dataset['Low-carbon electricity (% electricity)'].fillna(median_Low_carbon_electricity)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Low-carbon electricity (% electricity) - ' + str(median_Low_carbon_electricity))
plt.show()
Utilizamos la imputacion por median
dataset['Low-carbon electricity (% electricity)'].fillna(median_Low_carbon_electricity, inplace=True)
1.1.3 Análisis de Variable "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP)')
plt.show()
Imputación por Media y Mediana¶
mean_Energy_intensity_level_of_primary_energy = round(dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].mean(), 2)
temp_series = dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(mean_Energy_intensity_level_of_primary_energy)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP) - ' + str(mean_Energy_intensity_level_of_primary_energy))
plt.show()
median_Energy_intensity_level_of_primary_energy = round(dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].median(), 2)
temp_series = dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(median_Energy_intensity_level_of_primary_energy)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Energy intensity level of primary energy (MJ/$2017 PPP GDP) - ' + str(median_Energy_intensity_level_of_primary_energy))
plt.show()
Utilizamos la imputacion por median
dataset['Energy intensity level of primary energy (MJ/$2017 PPP GDP)'].fillna(median_Energy_intensity_level_of_primary_energy, inplace=True)
1.1.11 Análisis de Variable "Value_co2_emissions_kt_by_country"¶
fig = plt.figure(figsize=(5, 3))
dataset['Value_co2_emissions_kt_by_country'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country')
plt.show()
Imputación por Media y Mediana¶
mean_Value_co2_emissions_kt_by_country = round(dataset['Value_co2_emissions_kt_by_country'].mean(), 2)
temp_series = dataset['Value_co2_emissions_kt_by_country'].fillna(mean_Value_co2_emissions_kt_by_country)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country - ' + str(mean_Value_co2_emissions_kt_by_country))
plt.show()
median_Value_co2_emissions_kt_by_country = round(dataset['Value_co2_emissions_kt_by_country'].median(), 2)
temp_series = dataset['Value_co2_emissions_kt_by_country'].fillna(median_Value_co2_emissions_kt_by_country)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Value_co2_emissions_kt_by_country - ' + str(median_Value_co2_emissions_kt_by_country))
plt.show()
Utilizamos la imputacion por median
dataset['Value_co2_emissions_kt_by_country'].fillna(median_Value_co2_emissions_kt_by_country, inplace=True)
1.1.12 Análisis de Variable "Renewables (% equivalent primary energy)"¶
fig = plt.figure(figsize=(5, 3))
dataset['Renewables (% equivalent primary energy)'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy)')
plt.show()
Imputación por Media y Mediana¶
mean_Renewables_equivalent_primary_energy = round(dataset['Renewables (% equivalent primary energy)'].mean(), 2)
temp_series = dataset['Renewables (% equivalent primary energy)'].fillna(mean_Renewables_equivalent_primary_energy)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy) - ' + str(mean_Renewables_equivalent_primary_energy))
plt.show()
median_Renewables_equivalent_primary_energy = round(dataset['Renewables (% equivalent primary energy)'].median(), 2)
temp_series = dataset['Renewables (% equivalent primary energy)'].fillna(median_Renewables_equivalent_primary_energy)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('Renewables (% equivalent primary energy) - ' + str(median_Renewables_equivalent_primary_energy))
plt.show()
Utilizamos la imputacion por median
dataset['Renewables (% equivalent primary energy)'].fillna(median_Renewables_equivalent_primary_energy, inplace=True)
1.1.13 Análisis de Variable "gdp_growth"¶
fig = plt.figure(figsize=(5, 3))
dataset['gdp_growth'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth')
plt.show()
mean_gdp_growth = round(dataset['gdp_growth'].mean(), 2)
temp_series = dataset['gdp_growth'].fillna(mean_gdp_growth)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth - ' + str(mean_gdp_growth))
plt.show()
median_gdp_growth = round(dataset['gdp_growth'].median(), 2)
temp_series = dataset['gdp_growth'].fillna(median_gdp_growth)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_growth - ' + str(median_gdp_growth))
plt.show()
Utilizamos la imputacion por mean
dataset['gdp_growth'].fillna(mean_gdp_growth, inplace=True)
1.1.14 Análisis de Variable "gdp_per_capita"¶
fig = plt.figure(figsize=(5, 3))
dataset['gdp_per_capita'].hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita')
plt.show()
mean_gdp_per_capita = round(dataset['gdp_per_capita'].mean(), 2)
temp_series = dataset['gdp_per_capita'].fillna(mean_gdp_per_capita)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita - ' + str(mean_gdp_per_capita))
plt.show()
median_gdp_per_capita = round(dataset['gdp_per_capita'].median(), 2)
temp_series = dataset['gdp_per_capita'].fillna(median_gdp_per_capita)
fig = plt.figure(figsize=(5, 3))
temp_series.hist(bins=20, density=True, color='red', alpha=0.3)
plt.title('gdp_per_capita - ' + str(median_gdp_per_capita))
plt.show()
Utilizamos la imputacion por median
dataset['gdp_per_capita'].fillna(median_gdp_per_capita, inplace=True)
1.2 - Imputación de variables numéricas categoricas con faltante¶
No se realizara ninguna imputacion las las variables discretas y categoricas dado que no tienen faltantes
Para finalizar verificamos el porcentaje de faltantes en todas las columnas nuevamente, para asegurarnos que todos los faltantes se hayan tratado.
pd.DataFrame(dataset.isnull().mean()).transpose()
| Entity | Year | Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | ... | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Density (P/Km2) | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 rows × 21 columns
Escritura de Archivo de Variables a Disco.¶
dataset.to_csv("fase_1_proy.csv", index=False)
dataset_proy =pd.read_csv("fase_1_proy.csv")
dataset_proy.head()
| Entity | Year | Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | ... | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Density (P/Km2) | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2000 | 1.613591 | 6.2 | 9.22 | 20000.0 | 44.99 | 0.16 | 0.0 | 0.31 | ... | 302.59482 | 1.64 | 760.000000 | 6.29 | 3.440000 | 4578.630000 | 60 | 652230 | 33.93911 | 67.709953 |
| 1 | Afghanistan | 2001 | 4.074574 | 7.2 | 8.86 | 130000.0 | 45.60 | 0.09 | 0.0 | 0.50 | ... | 236.89185 | 1.74 | 730.000000 | 6.29 | 3.440000 | 4578.630000 | 60 | 652230 | 33.93911 | 67.709953 |
| 2 | Afghanistan | 2002 | 9.409158 | 8.2 | 8.47 | 3950000.0 | 37.83 | 0.13 | 0.0 | 0.56 | ... | 210.86215 | 1.40 | 1029.999971 | 6.29 | 3.440000 | 179.426579 | 60 | 652230 | 33.93911 | 67.709953 |
| 3 | Afghanistan | 2003 | 14.738506 | 9.5 | 8.09 | 25970000.0 | 36.66 | 0.31 | 0.0 | 0.63 | ... | 229.96822 | 1.40 | 1220.000029 | 6.29 | 8.832278 | 190.683814 | 60 | 652230 | 33.93911 | 67.709953 |
| 4 | Afghanistan | 2004 | 20.064968 | 10.9 | 7.75 | 5665000.0 | 44.24 | 0.33 | 0.0 | 0.56 | ... | 204.23125 | 1.20 | 1029.999971 | 6.29 | 1.414118 | 211.382074 | 60 | 652230 | 33.93911 | 67.709953 |
5 rows × 21 columns
def get_variables_scale(dataset):
continuas = [col for col in dataset.columns if dataset[col].dtype in ['float64','int64'] and len(dataset[col].unique())>30]
discretas = [col for col in dataset.columns if dataset[col].dtype in ['float64','int64'] and len(dataset[col].unique())<=30]
return continuas, discretas
cont, disct = get_variables_scale(dataset_proy)
cont
['Access to electricity (% of population)', 'Access to clean fuels for cooking', 'Renewable-electricity-generating-capacity-per-capita', 'Financial flows to developing countries (US $)', 'Renewable energy share in the total final energy consumption (%)', 'Electricity from fossil fuels (TWh)', 'Electricity from nuclear (TWh)', 'Electricity from renewables (TWh)', 'Low-carbon electricity (% electricity)', 'Primary energy consumption per capita (kWh/person)', 'Energy intensity level of primary energy (MJ/$2017 PPP GDP)', 'Value_co2_emissions_kt_by_country', 'Renewables (% equivalent primary energy)', 'gdp_growth', 'gdp_per_capita', 'Land Area(Km2)', 'Latitude', 'Longitude']
# creamos un dataframe para las variables continuas
proy_cont = pd.DataFrame(dataset_proy)
cont, disct = get_variables_scale(proy_cont)
df_continuas = proy_cont[cont]
df_continuas.head()
| Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | Low-carbon electricity (% electricity) | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.613591 | 6.2 | 9.22 | 20000.0 | 44.99 | 0.16 | 0.0 | 0.31 | 65.957440 | 302.59482 | 1.64 | 760.000000 | 6.29 | 3.440000 | 4578.630000 | 652230 | 33.93911 | 67.709953 |
| 1 | 4.074574 | 7.2 | 8.86 | 130000.0 | 45.60 | 0.09 | 0.0 | 0.50 | 84.745766 | 236.89185 | 1.74 | 730.000000 | 6.29 | 3.440000 | 4578.630000 | 652230 | 33.93911 | 67.709953 |
| 2 | 9.409158 | 8.2 | 8.47 | 3950000.0 | 37.83 | 0.13 | 0.0 | 0.56 | 81.159424 | 210.86215 | 1.40 | 1029.999971 | 6.29 | 3.440000 | 179.426579 | 652230 | 33.93911 | 67.709953 |
| 3 | 14.738506 | 9.5 | 8.09 | 25970000.0 | 36.66 | 0.31 | 0.0 | 0.63 | 67.021280 | 229.96822 | 1.40 | 1220.000029 | 6.29 | 8.832278 | 190.683814 | 652230 | 33.93911 | 67.709953 |
| 4 | 20.064968 | 10.9 | 7.75 | 5665000.0 | 44.24 | 0.33 | 0.0 | 0.56 | 62.921350 | 204.23125 | 1.20 | 1029.999971 | 6.29 | 1.414118 | 211.382074 | 652230 | 33.93911 | 67.709953 |
# Funcion para graficar las variables de la columna hotel_cont
def plot_outliers_analysis(dataset, col):
plt.figure(figsize=(10,2))
print(col)
plt.subplot(131)
dataset[col].hist(bins=50, density=True, color='red')
plt.title("Densidad -Histograma")
plt.subplot(132)
stats.probplot(dataset[col], dist = "norm", plot=plt)
plt.title("QQ-Plot")
plt.subplot(133)
sns.boxplot(y=dataset[col])
plt.title("Boxplot")
plt.show()
for col in cont:
plot_outliers_analysis(proy_cont, col)
Access to electricity (% of population)
Access to clean fuels for cooking
Renewable-electricity-generating-capacity-per-capita
Financial flows to developing countries (US $)
Renewable energy share in the total final energy consumption (%)
Electricity from fossil fuels (TWh)
Electricity from nuclear (TWh)
Electricity from renewables (TWh)
Low-carbon electricity (% electricity)
Primary energy consumption per capita (kWh/person)
Energy intensity level of primary energy (MJ/$2017 PPP GDP)
Value_co2_emissions_kt_by_country
Renewables (% equivalent primary energy)
gdp_growth
gdp_per_capita
Land Area(Km2)
Latitude
Longitude
# Funcion para la detecion de Outliers
def get_outliers_limits(dataset, col1):
IQR = dataset_proy[col].quantile(0.75)-dataset_proy[col].quantile(0.25)
LI = dataset_proy[col].quantile(0.25) -(1.5*IQR)
LS = dataset_proy[col].quantile(0.75) + (1.5*IQR)
return LI, LS
get_outliers_limits(df_continuas, df_continuas.columns)
(-98.7491465, 133.1688735)
#Creamos un nuevo dataframe para guardar las variables que se les aplico outliers
capped_df = pd.DataFrame()
for col in df_continuas.columns:
LI, LS = get_outliers_limits(df_continuas, col)
capped_df[col] = np.where(df_continuas[col] > LS, LS,
np.where(df_continuas[col] < LI, LI,
df_continuas[col]))
capped_df.head()
| Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | Low-carbon electricity (% electricity) | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.613591 | 6.2 | 9.22 | 5665000.0 | 44.99 | 0.16 | 0.0 | 0.31 | 65.957440 | 302.59482 | 1.64 | 760.000000 | 6.29 | 3.440000 | 4578.630000 | 652230.0 | 33.93911 | 67.709953 |
| 1 | 4.074574 | 7.2 | 8.86 | 5665000.0 | 45.60 | 0.09 | 0.0 | 0.50 | 84.745766 | 236.89185 | 1.74 | 730.000000 | 6.29 | 3.440000 | 4578.630000 | 652230.0 | 33.93911 | 67.709953 |
| 2 | 9.409158 | 8.2 | 8.47 | 5665000.0 | 37.83 | 0.13 | 0.0 | 0.56 | 81.159424 | 210.86215 | 1.40 | 1029.999971 | 6.29 | 3.440000 | 179.426579 | 652230.0 | 33.93911 | 67.709953 |
| 3 | 14.738506 | 9.5 | 8.09 | 5665000.0 | 36.66 | 0.31 | 0.0 | 0.63 | 67.021280 | 229.96822 | 1.40 | 1220.000029 | 6.29 | 8.832278 | 190.683814 | 652230.0 | 33.93911 | 67.709953 |
| 4 | 20.064968 | 10.9 | 7.75 | 5665000.0 | 44.24 | 0.33 | 0.0 | 0.56 | 62.921350 | 204.23125 | 1.20 | 1029.999971 | 6.29 | 1.414118 | 211.382074 | 652230.0 | 33.93911 | 67.709953 |
get_outliers_limits(capped_df, capped_df.columns)
(-98.7491465, 133.1688735)
#Graficamos todas las variables de nuestro nuevo daraframe
for col in cont:
plot_outliers_analysis(capped_df, col)
Access to electricity (% of population)
Access to clean fuels for cooking
Renewable-electricity-generating-capacity-per-capita
Financial flows to developing countries (US $)
Renewable energy share in the total final energy consumption (%)
Electricity from fossil fuels (TWh)
Electricity from nuclear (TWh)
Electricity from renewables (TWh)
Low-carbon electricity (% electricity)
Primary energy consumption per capita (kWh/person)
Energy intensity level of primary energy (MJ/$2017 PPP GDP)
Value_co2_emissions_kt_by_country
Renewables (% equivalent primary energy)
gdp_growth
gdp_per_capita
Land Area(Km2)
Latitude
Longitude
3. Posteriormente para las variables tratadas con outliers, verifique la forma de ladistribución y determine si es necesario aplicar algún tipo de transformación devariables para mejorar la forma de las distribuciones.¶
De ser el caso aplique la transformación que considere pertinente a fin de normalizar lo más posible la distribución de probabilidad de cada variable y mejorar el rendimiento del modelo. Recuerde que puede aplicar las siguientes transformaciones:¶
a. Logarítmica,¶
b. Exponencial,¶
c. Polinomial,¶
d. Box-Cox,¶
e. Yeo-Johnson¶
Funcion para graficar la dencidad
def plot_density_qq(df,variable):
plt.figure(figsize=(7,4))
plt.subplot(121)
df[variable].hist(bins=30)
plt.title(variable)
plt.subplot(122)
stats.probplot(df[variable], dist= 'norm', plot=plt)
plt.show()
1. Para "Access to electricity (% of population)"¶
col = "Access to electricity (% of population)"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Box -Cox
capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
1.9652
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
2.029
2. Para "Access to clean fuels for cooking"¶
col = "Access to clean fuels for cooking"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.8097
3. Para "Renewable-electricity-generating-capacity-per-capita"¶
col = "Renewable-electricity-generating-capacity-per-capita"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.2644
3. Para "Renewable-electricity-generating-capacity-per-capita"¶
col = "Renewable-electricity-generating-capacity-per-capita"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.2644
4. Para "Financial flows to developing countries (US $)"¶
col = "Financial flows to developing countries (US $)"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.1967
5. Para "Renewable energy share in the total final energy consumption (%)"¶
col = "Renewable energy share in the total final energy consumption (%)"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.307
6. Para "Electricity from fossil fuels (TWh)"¶
col = "Electricity from fossil fuels (TWh)"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
-0.1922
7. Para "Electricity from nuclear (TWh)"¶
col = "Electricity from nuclear (TWh)"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
8. Para "Electricity from renewables (TWh)"¶
col = "Electricity from renewables (TWh)"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
-0.3133
9. Para "Low-carbon electricity (% electricity)"¶
col = "Low-carbon electricity (% electricity)"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.3055
10. Para "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"¶
col = "Energy intensity level of primary energy (MJ/$2017 PPP GDP)"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Box -Cox
capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.3158
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.0172
11. Para "Value_co2_emissions_kt_by_country"¶
col = "Value_co2_emissions_kt_by_country"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Box -Cox
capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.1401
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.1394
12. Para "Renewables (% equivalent primary energy)"¶
col = "Renewables (% equivalent primary energy)"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
4.5589
13. Para "gdp_growth"¶
col = "gdp_growth"
plot_density_qq(capped_df, col)
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
1.0638
14. Para "gdp_per_capita"¶
col = "gdp_per_capita"
plot_density_qq(capped_df, col)
#Transformacion logaritmica
capped_df[col + '_log'] = np.log(capped_df[col])
plot_density_qq(capped_df, col + '_log')
#Transformacion Inversa
capped_df[col + '_inv'] = 1/(capped_df[col])
plot_density_qq(capped_df, col + '_inv')
#Trasformacion Polinomial orden 2
capped_df[col + '_cuand'] = (capped_df[col])**2
plot_density_qq(capped_df, col + '_cuand')
#Transformacion Box -Cox
capped_df[col + '_BC'], lmbd = stats.boxcox(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_BC')
0.0841
#Transformacion Yeo Johnson
capped_df[col + '_YJ'], lmbd = stats.yeojohnson(capped_df[col])
lmbd = round(lmbd,4)
print(lmbd)
plot_density_qq(capped_df, col + '_YJ')
0.0838
5. Finalmente una vez aplicadas todas las transformaciones descritas anteriormente, deberá aplicar el escalado de variables a todo el dataset. Recuerde que puede aplicar los siguientes tipos de feature scaling.¶
dataset_proy.describe()
| Year | Access to electricity (% of population) | Access to clean fuels for cooking | Renewable-electricity-generating-capacity-per-capita | Financial flows to developing countries (US $) | Renewable energy share in the total final energy consumption (%) | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | Low-carbon electricity (% electricity) | Primary energy consumption per capita (kWh/person) | Energy intensity level of primary energy (MJ/$2017 PPP GDP) | Value_co2_emissions_kt_by_country | Renewables (% equivalent primary energy) | gdp_growth | gdp_per_capita | Land Area(Km2) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3648.000000 | 3648.000000 | 3648.000000 | 3648.000000 | 3.648000e+03 | 3648.000000 | 3648.000000 | 3648.000000 | 3648.000000 | 3648.000000 | 3648.000000 | 3648.000000 | 3.648000e+03 | 3648.000000 | 3648.000000 | 3648.000000 | 3.648000e+03 | 3648.000000 | 3648.000000 |
| mean | 2010.041118 | 78.933693 | 63.255504 | 92.493616 | 4.353562e+07 | 32.142379 | 69.996209 | 12.989315 | 23.845069 | 36.708904 | 25747.285360 | 5.250461 | 1.423831e+05 | 8.651135 | 3.441471 | 12613.230060 | 6.332135e+05 | 18.246388 | 14.822695 |
| std | 6.052776 | 30.238162 | 38.133777 | 213.395777 | 1.998022e+08 | 29.168672 | 347.131749 | 71.786293 | 104.157507 | 34.129226 | 34777.415694 | 3.438690 | 7.285451e+05 | 10.051457 | 5.434772 | 19077.099547 | 1.585519e+06 | 24.159232 | 66.348148 |
| min | 2000.000000 | 1.252269 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.110000 | 1.000000e+01 | 0.000000 | -62.075920 | 111.927225 | 2.100000e+01 | -40.900557 | -175.198242 |
| 25% | 2005.000000 | 59.916941 | 25.862500 | 8.380000 | 5.665000e+06 | 7.095000 | 0.300000 | 0.000000 | 0.050000 | 3.030303 | 3116.636825 | 3.220000 | 2.509557e+03 | 6.290000 | 1.651476 | 1464.841885 | 2.571300e+04 | 3.202778 | -11.779889 |
| 50% | 2010.000000 | 98.272340 | 78.875000 | 32.880000 | 5.665000e+06 | 23.270000 | 2.970000 | 0.000000 | 1.470000 | 27.910000 | 13118.841000 | 4.300000 | 1.050000e+04 | 6.290000 | 3.440000 | 4578.630000 | 1.176000e+05 | 17.189877 | 19.145136 |
| 75% | 2015.000000 | 100.000000 | 100.000000 | 67.472500 | 5.665000e+06 | 52.612500 | 26.527500 | 0.000000 | 9.560000 | 64.038130 | 33897.402500 | 5.880000 | 5.136250e+04 | 6.290000 | 5.543696 | 13993.509465 | 5.131200e+05 | 38.969719 | 46.199616 |
| max | 2020.000000 | 100.000000 | 100.000000 | 3060.190000 | 5.202310e+09 | 96.040000 | 5184.130000 | 809.410000 | 2184.940000 | 100.000010 | 262585.700000 | 32.570000 | 1.070722e+07 | 86.836586 | 123.139555 | 123514.196700 | 9.984670e+06 | 64.963051 | 178.065032 |
#Creamos la funcion para evalucaion de las escala de nuestro dataFrame
def min_max_scale(df):
scaled_df = pd.DataFrame()
for col in df.columns:
if pd.api.types.is_numeric_dtype(df[col]):
min_val = df[col].min()
max_val = df[col].max()
scaled_df[col + '_minMaxScaled'] = (df[col] - min_val) / (max_val - min_val)
else:
scaled_df[col] = df[col]
return scaled_df
df_continuas = pd.read_csv("fase_1_proy.csv")
scaled_df = min_max_scale(df_continuas)
# Mostrar las primeras filas del DataFrame escalado
scaled_df.head()
| Entity | Year_minMaxScaled | Access to electricity (% of population)_minMaxScaled | Access to clean fuels for cooking_minMaxScaled | Renewable-electricity-generating-capacity-per-capita_minMaxScaled | Financial flows to developing countries (US $)_minMaxScaled | Renewable energy share in the total final energy consumption (%)_minMaxScaled | Electricity from fossil fuels (TWh)_minMaxScaled | Electricity from nuclear (TWh)_minMaxScaled | Electricity from renewables (TWh)_minMaxScaled | ... | Primary energy consumption per capita (kWh/person)_minMaxScaled | Energy intensity level of primary energy (MJ/$2017 PPP GDP)_minMaxScaled | Value_co2_emissions_kt_by_country_minMaxScaled | Renewables (% equivalent primary energy)_minMaxScaled | gdp_growth_minMaxScaled | gdp_per_capita_minMaxScaled | Density (P/Km2) | Land Area(Km2)_minMaxScaled | Latitude_minMaxScaled | Longitude_minMaxScaled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.00 | 0.003659 | 0.062 | 0.003013 | 0.000004 | 0.468451 | 0.000031 | 0.0 | 0.000142 | ... | 0.001152 | 0.047135 | 0.000070 | 0.072435 | 0.353728 | 0.036196 | 60 | 0.065321 | 0.706944 | 0.687612 |
| 1 | Afghanistan | 0.05 | 0.028581 | 0.072 | 0.002895 | 0.000025 | 0.474802 | 0.000017 | 0.0 | 0.000229 | ... | 0.000902 | 0.050216 | 0.000067 | 0.072435 | 0.353728 | 0.036196 | 60 | 0.065321 | 0.706944 | 0.687612 |
| 2 | Afghanistan | 0.10 | 0.082603 | 0.082 | 0.002768 | 0.000759 | 0.393898 | 0.000025 | 0.0 | 0.000256 | ... | 0.000803 | 0.039741 | 0.000095 | 0.072435 | 0.353728 | 0.000547 | 60 | 0.065321 | 0.706944 | 0.687612 |
| 3 | Afghanistan | 0.15 | 0.136573 | 0.095 | 0.002644 | 0.004992 | 0.381716 | 0.000060 | 0.0 | 0.000288 | ... | 0.000876 | 0.039741 | 0.000113 | 0.072435 | 0.382842 | 0.000638 | 60 | 0.065321 | 0.706944 | 0.687612 |
| 4 | Afghanistan | 0.20 | 0.190513 | 0.109 | 0.002533 | 0.001089 | 0.460641 | 0.000064 | 0.0 | 0.000256 | ... | 0.000778 | 0.033580 | 0.000095 | 0.072435 | 0.342790 | 0.000806 | 60 | 0.065321 | 0.706944 | 0.687612 |
5 rows × 21 columns